Skip to content

feat: add two-phase extension upgrade with spec.schemaVersion#328

Draft
WentingWu666666 wants to merge 13 commits intodocumentdb:mainfrom
WentingWu666666:developer/two-phase-extension-upgrade
Draft

feat: add two-phase extension upgrade with spec.schemaVersion#328
WentingWu666666 wants to merge 13 commits intodocumentdb:mainfrom
WentingWu666666:developer/two-phase-extension-upgrade

Conversation

@WentingWu666666
Copy link
Copy Markdown
Collaborator

Summary

Add spec.schemaVersion field to DocumentDBSpec to decouple binary (image) upgrades from schema (ALTER EXTENSION) upgrades. This provides a rollback-safe window between deploying a new binary and committing the schema change.

Closes #271

Three Modes

spec.schemaVersion Behavior Use Case
Not set (default) Two-phase: schema stays at current version until user explicitly sets schemaVersion Production: manual control, rollback safety
"auto" Auto-finalize: schema updates to match binary version automatically Dev/test: simple, no rollback safety
Explicit version (e.g. "0.112.0") Schema updates to exactly that version (must be <= binary) Pin to specific version

User Flow (Two-Phase)

# Step 1: Update binary only (rollback-safe)
spec:
  documentDBVersion: "0.112.0"
  # schemaVersion: not set -> schema stays at old version

# Step 2: After validation, finalize schema
spec:
  documentDBVersion: "0.112.0"
  schemaVersion: "0.112.0"   # triggers ALTER EXTENSION UPDATE

Changes

  • api/preview/documentdb_types.go: Add SchemaVersion to DocumentDBSpec with validation pattern
  • internal/controller/documentdb_controller.go: Add determineSchemaTarget(), modify upgradeDocumentDBIfNeeded() to gate ALTER EXTENSION on spec.schemaVersion
  • internal/utils/util.go: Add SemverToExtensionVersion() inverse conversion
  • CRDs: Regenerated in both config/crd/ and helm-chart/crds/
  • Tests: 5 new tests for two-phase modes + 3 updated existing tests
  • Docs: Updated upgrades.md with Schema Version Control section

Backward Compatibility

Breaking change: Default behavior changes from auto-update to two-phase. Existing users upgrading to this operator version will need to set schemaVersion: "auto" to restore previous behavior, or adopt two-phase upgrades. This is intentional -- safe by default for a database operator.

@WentingWu666666 WentingWu666666 force-pushed the developer/two-phase-extension-upgrade branch 3 times, most recently from a3bdb5c to c0d35b1 Compare March 26, 2026 17:50
Add spec.schemaVersion field to DocumentDBSpec to decouple binary (image)
upgrades from schema (ALTER EXTENSION) upgrades. This provides a rollback-safe
window between deploying a new binary and committing the schema change.

Three modes:
- Empty (default): Two-phase mode. Schema stays at current version until user
  explicitly sets schemaVersion. Safe by default for production.
- "auto": Auto-finalize. Schema updates to match binary version automatically.
  Simple mode for development and testing.
- Explicit version: Schema updates to exactly that version. Must be <= binary.

Changes:
- api/preview/documentdb_types.go: Add SchemaVersion to DocumentDBSpec
- internal/controller/documentdb_controller.go: Add determineSchemaTarget()
  function, modify upgradeDocumentDBIfNeeded() to gate ALTER EXTENSION on
  spec.schemaVersion value
- internal/utils/util.go: Add SemverToExtensionVersion() inverse conversion
- Regenerated CRDs (config/crd + helm chart)
- Added unit tests for all three modes and edge cases
- Created public upgrade documentation (docs/operator-public-documentation/)
- Added Upgrading DocumentDB page to mkdocs.yml navigation

Closes documentdb#271

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
@WentingWu666666 WentingWu666666 force-pushed the developer/two-phase-extension-upgrade branch from c0d35b1 to 525700b Compare March 26, 2026 18:08
wentingwu000 and others added 12 commits March 26, 2026 14:47
Add a ValidatingWebhookConfiguration that enforces:
- schemaVersion must be <= binary version (on create and update)
- Image rollback below installed schema version is blocked (on update)

Components added:
- internal/webhook/documentdb_webhook.go: ValidateCreate/ValidateUpdate handlers
- internal/webhook/documentdb_webhook_test.go: 18 unit tests
- Helm template 10_documentdb_webhook.yaml: Issuer, Certificate, Service,
  ValidatingWebhookConfiguration with cert-manager CA injection
- Updated 09_documentdb_operator.yaml: webhook port, cert volume mount, args
- Updated cmd/main.go: webhook registration

The webhook runs inside the existing operator process on port 9443 using
cert-manager for TLS (same pattern as the sidecar injector). failurePolicy
is set to Fail for database safety. The controller retains defense-in-depth
checks as a secondary safety net.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Restructure webhook to use two validation function slices following the
CNPG ClusterCustomValidator pattern:

- validate(db)  spec-level rules run on both create and update
  Contains: validateSchemaVersionNotExceedsBinary
- validateChanges(new, old)  update-only rules comparing old vs new
  Contains: validateImageRollback

This makes it easy to add new validation rules  just append a function
to the appropriate slice. Each validator has a consistent signature
returning field.ErrorList, and errors from all validators are aggregated.

Also adds var _ webhook.CustomValidator = &DocumentDBValidator{} compile
check and uses apierrors.NewInvalid for proper Kubernetes error format.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
- Remove Step 1b (image rollback blocking) from controller  webhook handles it
- Simplify determineSchemaTarget: keep lightweight defense-in-depth guard
- Rename Pg-suffixed variables to full names (schemaExtensionVersion, etc.)
- Refactor webhook tests to Ginkgo/Gomega (matching CNPG and controller patterns)
- Add suite_test.go for webhook package

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
- Reframe upgrade types as Operator (control plane) vs DocumentDB (data plane)
- Explain documentDBVersion vs schemaVersion relationship clearly
- Reorganize data plane upgrade as step-by-step walkthrough with production/dev tabs
- Add Multi-Region Upgrades section with coordination guidance
- Move Advanced Image Overrides to its own section

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
CNPG default is in-place restart. Switchover promotes a replica first,
then restarts the old primary as replica, minimizing write downtime.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement safety-gap pattern for rollback safety by decoupling image update from schema update

2 participants